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Abstract 

A  Parallel  Programming  Archetype  is  a  language-independent  program  de¬ 
sign  strategy.  We  describe  two  archetypes  in  combinatorics  and  optimiza¬ 
tion,  their  components,  implementations,  and  example  applications  devel¬ 
oped  using  an  archetype. 
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Chapter  1 


Introduction 


1.1  Goal 

The  goal  of  this  thesis  is  to  study  two  archetypes  in  combinatorics  and  op¬ 
timization,  the  Divide- and- Conquer  Archetype  and  the  Branch  and  Bound 
Archetype,  and  to  demonstrate  how  these  archetypes  can  be  used  for  the 
systematic  design  of  efficient  sequential  and  parallel  programs.  The  research 
whose  results  are  presented  in  this  document  is  part  of  the  ongoing  project 
on  Parallel  Programming  Archetypes. 


1.2  Motivation 

As  networks  of  workstations  and  distributed  systems  that  can  exploit  par¬ 
allel  computing  become  more  widespread,  the  need  for  tools  to  aid  the  de¬ 
velopment  of  parallel  programs  grows  as  well.  Many  scientists  do  not  take 
advantage  of  the  available  hardware  because  often  the  effort  required  for  de¬ 
veloping  a  new  parallel  program  from  scratch  or  parallelizing  existing  code 
is  not  justified  by  the  potential  speedup.  We  think  that  it  is  possible  to 
reduce  this  effort  by  using  parallel  programming  archetypes. 

Many  parallel  applications  share  common  features  in  design,  program 
structure,  communication  pattern,  reasoning,  debugging,  testing  and  per¬ 
formance  tuning.  This  allows  for  development  of  an  archetype  as  an  ab¬ 
straction  that  embodies  these  common  characteristics.  Understanding  of 
an  archetype  helps  a  programmer  to  understand  all  algorithms  contained 
within  the  archetype’s  application  domain.  Moreover,  knowledge  of  several 
archetypes  forms  a  basis  for  a  systematic  approach  to  program  design  for  a 
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new  problem. 

Archetypes  are  being  developed  as  language-independent  design  meth¬ 
ods,  allowing  users  to  choose  programming  languages  and  communication 
libraries,  as  well  as  familiar  programming  environments  and  debugging  tools. 
For  many  archetypes  most  of  the  debugging  and  testing  can  be  done  on  a 
sequential  program,  from  which  a  correct  and  efficient  parallel  program  can 
be  easily  derived. 

1.3  What  Are  Archetypes? 

An  archetype  is  a  method  of  problem  solving  characterized  by  a  design  strat¬ 
egy.  ft  consists  of  several  parts  including: 

1.  the  structure  of  a  class  of  programs, 

2.  methods  of  developing  parallel  and  sequential  applications, 

3.  frameworks  for  reasoning  about  correctness, 

4.  suggestions  for  test  suites  and  debugging,  and  tips  based  on  the  expe¬ 
rience  of  others, 

5.  suggestions  for  performance  tuning  and  performance  models  for  differ¬ 
ent  architectures,  and 

6.  a  collection  of  applications  developed  using  an  archetype,  each  with  a 
collection  of  programs  in  several  programming  languages. 

Currently  archetypes  are  being  developed  in  two  areas:  archetypes  for 
scientific  applications  [5,  10]  such  as  the  Mesh  Archetype  and  the  Spectral 
Methods  Archetypes,  and  archetypes  for  combinatorics  and  optimization, 
such  as  the  Branch  and  Bound  Archetype  and  the  Dynamic  Programming 
Archetype. 

1.4  Architectures,  Languages  and  Libraries 

Today  many  different  architectures  are  available  to  the  users  for  development 
of  concurrent  programs:  heterogeneous  or  homogeneous  networks  of  work¬ 
stations  or  PCs,  massively  parallel  supercomputers  and  multiprocessor  work¬ 
stations.  These  architectures  have  different  characteristics  and  may  require 
different  programming  approaches  in  order  to  achieve  good  performance.  In 


2 


order  to  show  how  the  archetypal  approach  can  be  used  for  different  architec¬ 
tures,  we  will  consider  two  major  architectures:  networks  of  single-processor 
computers,  such  as  workstations  or  PCs,  and  a  supercomputer.  The  former 
can  be  characterized  by  low  communication  to  computational  speed  ratio, 
whereas  the  latter  can  be  characterized  by  high  communication  speed.  Most 
programs  in  this  report  were  written  for  a  network  of  Sun  SPARC  stations 
and  for  the  Intel  Touchstone  Delta. 

Archetypes  are  being  developed  as  a  language-independent  approach  to 
sequential  and  parallel  program  design,  allowing  programmers  to  take  ad¬ 
vantage  of  the  special  features  of  parallel  languages  and  communication 
libraries.  In  this  report  we  used  C  programming  language  together  with  two 
of  communication  libraries  and  a  task-parallel  language: 

PVM:  Programs  developed  for  networks  of  workstations  were  written  using 
the  Parallel  Virtual  Machine  (PVM)  system  developed  at  Oak  Ridge 
National  Laboratory  and  the  University  of  Tennessee  [7].  The  system 
uses  message  passing  to  exploit  parallel  computing  across  a  wide  va¬ 
riety  of  distributed  systems,  including  networks  of  workstations  and 
massively  parallel  computers. 

NX:  Programs  developed  for  the  Intel  Touchstone  Delta  were  written  in 
the  C  programming  language  using  the  NX  communication  library  for 
message-passing  between  processes  [8]. 

CC+  +  :  Several  programs  were  written  in  Compositional  C  +  +  (CC  +  + ), 
a  parallel  programming  language  based  on  the  C++  programming 
language  and  developed  at  Caltech  [3,  12],  CC  +  +  has  a  few  simple 
extensions  to  allow  construction  of  parallel  libraries  on  a  variety  of 
architectures. 


1.5  Overview 

This  report  is  divided  into  2  chapters,  each  of  which  describes  a  different 
programming  archetype.  Each  chapter  contains  the  following  sections: 

Introduction:  an  intuitive  description  of  an  archetype’s  approach  and  pro¬ 
grams  in  the  archetype’s  application  domain. 

Archetype  Skeleton:  the  archetype’s  strategy,  presented  in  a  more  formal 
way,  with  the  skeleton  of  the  main  procedure. 
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Archetype  Components:  the  specification  of  user- defined  functions,  pro¬ 
cedures  and  data  types. 

Approaches  to  Parallel  Implementation:  one  or  more  approaches  to 
parallel  implementation  of  an  archetype. 

Implementations:  description  of  several  implementations  of  the  archetype’s 
programming  template. 

Applications:  description  of  one  or  more  applications  from  an  archetype’s 
domain. 

The  following  points  concerning  assertional  and  coding  notation  should 

be  noted: 

•  All  data  type  names  used  throughout  the  report  end  with  “_t”. 

•  In  some  cases,  the  names  of  the  procedures  are  also  used  as  predicates 
in  the  assertions.  Such  a  predicate  holds  if  and  only  if  the  execution 
of  the  procedure  with  the  given  input  parameters  produces  the  given 
output  parameters.  For  example,  if  procedure  proc(x,  y)  has  input 
parameter  x  and  output  parameter  y,  then  predicate  proc{a,b)  holds 
if  and  only  if,  after  execution  of  proc(a,y),  y  =  b. 
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Chapter  2 


The  Divide-and- Conquer 
Archetype 

2.1  Introduction 

The  Divide-and-Conquer  Archetype  is  based  on  a  well-known  strategy  [2,  11] 
for  solving  large  problems,  if  there  exists  an  algorithm  for  solving  smaller 
problems  of  the  same  type.  Although  both  recursive  and  non-recursive  im¬ 
plementations  of  this  approach  exist,  it  is  usually  described  as  the  following 
recursive  algorithm:  when  given  a  large  problem, 

1.  take  the  problem  and  divide  it  into  strictly  smaller  problems  of  the 
same  type;  continue  dividing  until  a  problem  is  reached  that  is  small 
enough  to  be  solved  directly.  A  problem  of  such  size  is  usually  called 
a  base-case  problem,  or  simply  a  base-case ; 

2.  upon  reaching  a  base-case  problem,  solve  it  using  some  known  algo¬ 
rithm; 

3.  then  take  the  solutions  to  the  smaller  problems  and  merge  them  into 
the  solutions  to  larger  problems,  until  a  solution  to  the  original  prob¬ 
lem  is  obtained. 

Whether  this  approach  will  produce  an  efficient  sequential  algorithm 
depends  on  the  ability  to  split  a  problem  and/or  recombine  subsolutions 
in  efficient  manner.  However,  even  for  problems  that  do  not  have  efficient 
sequential  solutions,  using  the  Divide-and-Conquer  strategy,  the  archetype 
might  provide  an  efficient  parallel  solution. 
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2.2  Archetype  Skeleton 

The  outline  of  the  sequential  Divide- and- Conquer  Archetype  is  as  follows: 

Solutions  DnC  (Problemjt  aProblem) 

{ 

Solutions  aSolution; 

if  (Size (aProblem)  <  BaseCaseSize) 

aSolution  =  BaseCaseSolution(aProblem) ; 
else  { 

Problemjt  subProblems [K]  ; 

Otherlnfojt  Other; 

Solution_t  subSolutions [K] ; 
int  i; 

Split (aProblem,  subProblems,  Other); 
for  (i  =  0;  i  <  K;  i++) 

subSolutions [i]  =  DnC (subProblems [i] ) ; 
aSolution  =  Merge (subSolutions ,  Other); 

} 

return  aSolution; 

} 

Constants  K  and  BaseCaseSize,  and  functions  SizeQ,  BaseCaseSolutionQ, 
SplitQ  and  Merge()  are  described  in  the  following  sections. 

2.3  Archetype  Components 

In  general,  in  order  to  develop  a  sequential  Divide- and- Conquer  algorithm, 
the  user  has  to  define  3  data  types  and  several  functions  and  procedures 
on  these  data  types.  Additionally,  for  specification  and  reasoning  purpose, 
several  predicates  should  be  defined. 

Data  types: 

Problemjt  is  a  user-defined  data  type  representing  a  problem  being  solved. 

Solution_t  is  a  user-defined  data  type  representing  a  solution  to  some  prob¬ 
lem  of  type  Problemjt. 
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Other  Inf  o_t  is  an  optional  user-defined  data  type  representing  some  infor¬ 
mation  which  is  produced  by  the  Split( )  procedure  in  addition  to  the 
subproblems,  and  then  used  by  the  MergeQ  procedure.  Many  Divide- 
and- Conquer  algorithms  do  not  use  this  data  type. 

Predicates:  For  specification  and  reasoning  purposes  only,  the  user  should 

define  two  predicates  on  the  variables  of  the  user-defined  data  types: 

isW ell Formed(  P)  holds  if  and  only  if  problem  P  is  a  well  formed  problem. 
The  reason  for  having  this  predicate  is  that  programming  languages 
allow  variables  of  certain  types  to  have  many  values,  some  of  which 
are  not  well  formed  in  the  context  of  the  problem  being  solved. 

isSolution(P ,  S )  holds  if  and  only  if  solution  S  is  a  correct  solution  to  prob¬ 
lem  P. 

Constants,  Functions  and  Procedures:  The  following  procedures  and 

functions  must  be  defined  for  a  Divide- and- Conquer  algorithm  according  to 

the  specifications  given: 

SizeQ  is  a  function  that  returns  some  metric  value  on  the  size  of  a  prob¬ 
lem.  Together  with  the  constant  BaseCaseSize  this  function  is  used 
to  determine  whether  a  given  problem  is  small  enough  to  be  solved 
directly. 

BaseCaseSize  is  a  constant  that  defines  the  largest  problem  that  can  be 
solved  directly  (using  some  other  algorithm).  Any  problems  that  have 
size  smaller  than  BaseCaseSize  will  be  solved  using  some  other  algo¬ 
rithm. 

BaseCaseSolutionQ  is  a  function  that  uses  some  algorithm  to  solve  di¬ 
rectly  the  problems  of  the  size  no  larger  than  BaseCaseSize.  The 
formal  specification  of  this  function  is  as  follows: 

Solutions  BaseCaseSolution(Problem_t  aProblem) 

/*  Precondition:  isW  ell  F or  med(a  Problem)  and 

*  Size(aProblem)  <  BaseCaseSize 

*  Postcondition:  (Return  value  =  aSolution)  and 

*  isSolution(aProblem,  aSolution ) 

V 
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SplitQ  is  a  function  that  divides  a  given  (large)  problem  into  K  subprob¬ 
lems  of  strictly  smaller  size,  whose  solutions  when  merged  produce  the 
solution  to  the  given  problem.  Formally, 

void  Split (Problemjt  aProblem,  Problem_t  subProblems  []  , 
Otherlnfojt  Other) 

/*  Precondition:  isW ell F or med(a Problem)  and 

*  Size(aProblem)  >  BaseCaseSize 

*  Postcondition: 

*  (Vft  :  0  <  k  <  K  :  Size(subProblems[i])  <  Size(aProblem)) 

*  and 

*  (BiS'oIsf]  : 

*  (Vft  :  0  <  k  <  K  :  isS  olution(subProblems[k],  So/s[&]))  : 

*  (VS  :  is Solution(a Problem,  S) :  S  =  Merge(Sols ,  Other))) 

*/ 

Merge()  is  a  function  that  merges  a  given  set  of  subsolutions  into  the  solu¬ 
tion  to  the  bigger  problem,  such  that  the  subproblems,  to  which  the 
given  subsolutions  are  solutions,  were  produced  by  the  split  of  that 
problem.  Formally: 

Solution_t  Merge (Solutionrt  subSolutions  []  , 

Otherlnfojt  Other) 

/*  Precondition:  true 

*  Postcondition:  ( Return  value  =  aS olution)  and 

*  (3  Probs[\  : 

*  (Vfc  :  0  <  k  <  K  : 

*  isS  olution)  Probs[k],  subSolutions[k ]))  : 

*  (VP  :  Split(P ,  Probs\\ ,  Other)  : 

*  isSolution(P,  aSolution)) 

*/ 

2.3.1  Algorithmic  Parameters 

In  the  performance  analysis  we  will  use  the  following  parameters  of  a  Divide- 
and-Conquer  algorithm: 

K  is  the  maximum  number  of  the  subproblems  returned  by  the  Split () 
procedure,  and 


S/b  is  the  maximum  size  of  the  subproblems  returned  by  the  SplitQ,  where 
S  is  the  size  of  the  original  problem. 

For  example,  for  the  Mergesort  algorithm  these  parameters  are  K  =  2  and 
6  =  2,  for  Binary  search  K  =  1  and  6=2. 

2.4  Approaches  to  Parallel  Implementation 

The  data  flow  structure  of  a  Divide- and- Conquer  algorithm  is  shown  in  fig¬ 
ure  2.1.  It  consists  of  a  growing  tree  of  SplitQ  processes  concatenated  with 
a  shrinking  tree  of  MergeQ  processes.  Two  main  approaches  to  parallel  im¬ 
plementation  of  a  Divide-and-Conquer  algorithms  map  this  graph  differently 
in  time  and  space. 


Figure  2.1:  Data  flow  graph  of  a  Divide-and-Conquer  algorithm. 
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Data  Flow  Approach.  If  the  subproblems  produced  by  the  SplitQ  func¬ 
tion  are  independent  and  can  be  solved  separately,  then  one  can  paral¬ 
lelize  the  solving  of  subproblems  in  the  sequential  algorithm  (making  a 
parf  or  loop  out  of  the  for  loop  in  the  DnC()  procedure  in  section  2.2). 

Thus,  SplitQ,  MergeQ  and  the  solving  of  the  subproblems  will  be 
executed  sequentially,  with  the  subproblems  solved  in  parallel.  The 
general  program  structure  of  this  approach  is  given  below: 

Solutions  Parallel_DnCl (Problemjt  aProblem) 

{ 

Solutions  aSolution; 

if  (Size (aProblem)  <  BaseCaseSize) 

aSolsution  =  BaseCaseSolution(aProblem) ; 
else  { 

Problemjt  subProblems [K] ; 

Solutionjt  subSolutions [K] ; 

Otherlnfojt  Other; 
int  i ; 

Split (Problem,  subProblems,  Other); 
parfor(i  =  0;  i  <  K;  i++) 

subSolutions [i]  =  Parallel_DnCl (subProblems [i] ) ; 
aSolution  =  Merge (subSolut ions ,  Other); 

} 

} 

Note  that  the  structure  of  this  approach  does  not  require  any  changes 
to  the  user-defined  sequential  functions  SizeQ,  BaseCaseSolution( ), 
SplitQ  and  MergeQ. 

Control  Flow  Approach.  This  approach  is  applicable  to  Divide- and- Conquer 
algorithms  for  which  a  problem  and/or  a  solution  can  be  represented 
as  a  collection  of  independent  parts  of  the  same  type.  All  component 
functions  of  the  archetype  then  can  be  rewritten  as  separate  processes 
which  use  messages  to  communicate  parts  of  subproblems/subsolutions 
to  each  other,  and  which  produce  parts  of  subproblems/subsolutions 
based  only  on  partial  information  (several  parts  of  a  subproblem/subsolution) 
received  from  the  other  processes.  Then,  after  the  decision  is  made 
whether  a  problem  is  a  base-case  problem  or  not,  function  SplitQ, 


10 


function  Merge( )  and  the  solving  of  the  subproblems  can  be  executed 
concurrently.  The  skeleton  for  this  implementation  is  as  follows: 


Solutionjt  Parallel_DnC2(Problem_t  aProblem) 

{ 

Solutionjt  aSolution; 

if  (Size (aProblem)  <  BaseCaseSize) 

aSolution  =  BaseCaseSolution2(Problem) ; 
else  { 

Problemjt  subProblems [K]  ; 

Solutionjt  subSolutions [K] ; 

Other Inf ojt  Other; 

Par{ 

Split2(Problem,  subProblems,  Other); 
parfor(int  i  =  0;  i  <  K;  i++) 

subSolutions  [i]  =  Parallel_DnC2 ( subProblems [i] ) ; 

aSolution  =  Merge2 (SubSolutions ,  Other); 

} 

} 

} 


Note  that  the  declaration  of  subProblems  and  subSolutions  as  well  as 
the  parameters  to  the  functions  corresponds  more  to  the  declarations 
of  the  communication  channels  between  the  processes  rather  than  to 
the  variable  declarations.  Also  note  that  we  use  function  names  of 
the  form  FunctionName2()  instead  of  FunctionName( )  to  indicate  that 
these  functions  differ  from  those  used  in  the  sequential  algorithm. 


Even  though  the  Control  Flow  approach  allows  for  more  concurrency, 
the  Data  Flow  approach  is  very  easy  to  use  to  parallelize  already  existing 
sequential  code.  Even  development  of  a  parallel  algorithm  from  scratch  is 
easier  with  this  approach,  since  the  user  can  develop  and  debug  the  sequen¬ 
tial  program  first.  The  rest  of  this  chapter  focuses  on  the  implementation 
and  performance  analysis  of  the  Data  Flow  approach. 


11 


2.5  Data  Flow  Approach  and  Performance 


Once  the  user  has  developed  a  sequential  Divide- and- Conquer  algorithm, 
the  Data  Flow  approach  can  easily  be  used  to  obtain  a  parallel  algorithm 
from  the  sequential  code.  A  programming  template  can  be  provided  to  the 
user  to  obtain  a  parallel  algorithm  by  simple  instantiation  of  the  component 
sequential  functions.  However,  in  order  for  the  parallel  implementation  to 
be  efficient  several  issues  have  to  be  taken  into  consideration,  such  as  gran¬ 
ularity  and  mapping  of  the  processes. 

To  analyze  and  predict  the  performance  of  a  parallel  Divide- and- Conquer 
algorithm,  we  need  to  know  some  performance  characteristics  of  the  target 
architecture: 

•  Np:  the  number  of  the  processors  in  the  system. 

•  Tcom'-  average  time  for  communication  between  the  processors  or  ma¬ 
chines  in  the  system.  This  time  can  be  given  as  a  function  of  the  size 
of  the  message  or  the  size  of  a  problem. 

•  T split'-  execution  time  of  the  sequential  Split( )  procedure  as  a  function 
of  the  problem  size. 

•  T merge-  execution  time  of  the  Merge()  procedure  as  a  function  of  the 
problem  size. 

•  T^ase— case*  execution  time  of  the  BaseCaseSolutionQ  as  a  function 
of  the  problem  size. 

We  also  make  the  following  assumptions  about  the  target  architecture: 

•  available  processors  are  identical; 

•  only  asynchronous  communication  actions  are  used;  in  particular,  only 
non-blocking  sends  are  used; 

•  the  time  that  the  sender  of  a  message  spends  on  communication  actions 
is  negligibly  small. 

Assuming  that  the  size  of  the  original  problem  is  S  =  bg  and  the  size  of 
a  base-case  problem  is  BaseCaseSize  =  1,  and  using  the  above  functions, 
we  can  express  the  execution  time  of  the  sequential  algorithm  as  follows: 
q-l 

■rf "(//')  =  E  A  ‘  (t*pw(&9“!)  +  *))  +  KqTbase_case  (2.1) 

i= 0 
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2.5.1  Mapping 

Let  us  consider  a  mapping  of  the  data-flow  graph  of  a  Divide- and- Conquer 
algorithm  (e.g.,  see  figure  2.1)  onto  the  processors  of  the  system. 


Initial  Problem 


mapped  onto  the  same 
processor 


Final  Solution 


Figure  2.2:  A  mapping  of  a  Divide- and- Conquer  data  flow  graph. 


Note  that  the  execution  of  a  process  at  any  level  of  the  data-flow  graph 
does  not  start  until  all  preceding  processes  (predecessors  in  the  graph)  have 
terminated.  If  the  sizes  of  the  subproblems  returned  by  the  Split ()  func¬ 
tions  are  approximately  equal,  then  the  execution  times  of  all  processes  on 
one  level  of  the  graph  are  equal.  Then,  suppose  that  we  map  the  processes 
onto  the  system  as  shown  in  figure  2.2.  If  the  communication  overhead  is 
relatively  small,  such  a  mapping  will  produce  an  efficient  implementation. 

From  the  execution  time  functions  for  the  BaseCaseSolution( ),  SplitQ 
and  MergeQ  procedures,  we  can  derive  a  recursive  expression  for  the  execu¬ 
tion  time  of  the  parallel  algorithm  with  this  method: 

T^(S)  =  T  sp[it(S)  +  Tpar(^-)  +  T  merge{S)  +  2T  com{S) 


13 


Suppose  that  the  size  of  the  original  problem  is  S  =  bq  and  problems  of  size 
S  =  bc  are  solved  sequentially;  then 

T;''ar(//’j  -.  +  2T, 1 ))  +  T*"'((/)  (2.2) 

*  =  C+ 1 

2.5.2  Granularity 

Theoretically,  the  execution  time  of  a  “good”  parallel  program  should  de¬ 
crease  as  the  number  of  processors  available  for  computation  increases.  The 
actual  speedup  of  a  program,  however,  is  limited  by  the  communication 
speed.  Efficiency  of  a  parallel  algorithm  is  not  determined  solely  by  whether 
the  algorithm  uses  all  available  resources  or  whether  it  uses  them  as  soon  as 
possible. 

Parallel  Base-Case  Size 

Communication  overhead  will  make  the  parallel  Divide- and- Conquer  algo¬ 
rithm  for  problems  of  a  certain  size  less  efficient  than  the  sequential  algo¬ 
rithm.  Let  ParBaseCaseSize  be  the  largest  size  of  a  problem  such  that  the 
sequential  algorithm  is  more  efficient  for  its  solution  than  the  parallel  one. 
Knowing  the  performance  characteristics  of  the  system,  one  can  predict  the 
value  of  ParBaseCaseSize. 

Infinite  number  of  processors.  Suppose  that  the  target  architecture 
consists  of  an  infinite  number  of  identical  processors  (  A),  =  oo).  Suppose 
that  each  process  is  mapped  onto  a  separate  processor  as  described  in  sec¬ 
tion  2.5.1.  Let  T)jtn  denote  the  execution  time  of  the  parallel  implementation 
in  which  subproblems  in  the  first  k  levels  of  the  data  flow  graph  are  solved 
the  using  parallel  algorithm,  and  subproblems  on  the  lower  levels  are  solved 
using  the  sequential  algorithm.  For  example,  T^“7  corresponds  to  the  im¬ 
plementation  in  which  the  original  problem  is  split  into  subproblems,  which 
are  then  solved  using  the  sequential  algorithm  on  separate  processors,  and 
their  subsolutions  are  then  merged.  Tj)1'  can  be  expressed  as: 

T?'(S)  =  £  (T,p,.,(2r)  +  T„„i,«(AT)  +  2T„m(|))+T«(L)  (2.3) 

Suppose  that  there  exists  a  problem  size  S  such  that  the  sequential 
solution  of  problems  of  size  S  is  more  efficient  than  their  parallel  solution 
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with  one  level  of  the  graph  executed  in  parallel: 

T seq{S)  <  Tf  r(S) 

Using  equations  (2.3)  and  (2.1)  we  can  find  a  condition  on  the  size  S  in  terms 
of  execution  time  of  sequential  algorithm  and  communication  overhead: 

t“s(£  )  <  7rh:T»".(£> 

For  non- decreasing  execution  time  functions  TSplit  and  Tmerge  and  constant 
function  Tcom(5')  it  is  possible  to  show  that  from  inequality  (2.4)  follows 
that : 

Vn  =  1,2,...:  Tfr(5)  >  Tse9(5), 

Thus,  if  inequality  (2.4)  holds,  then  solving  a  problem  of  size  bS  with 
any  number  of  parallel  steps  is  less  efficient  than  its  sequential  solution. 
Even  though  in  most  distributed  systems  communication  time,  Tcom,  is  not 
a  constant,  its  dependence  on  the  size  of  the  message  is  usually  relatively 
small  (in  comparison  with  functions  Tspnt  and  Tmerge).  Therefore,  for  such 
systems  the  inequality  (2.4)  can  be  used  as  a  good  upper  bound  on  the  value 
of  ParBaseCaseSize. 

Finite  number  of  processors.  If  the  system  consists  of  a  finite  number 
of  processors  of  the  same  type,  it  imposes  more  restrictions  on  the  value  of 
ParBaseCaseSize.  Let  us  find  an  upper  bound  on  ParBaseCaseSize  for 
the  system  in  which  the  number  of  processors  ( Np )  is  a  power  of  K,  i.e., 
Np  =  Kp  for  some  p. 

If  the  mapping  strategy  discussed  in  section  2.5.1  is  used,  then  until 
the  execution  of  p- th  level  of  the  tree,  there  will  be  only  one  process  per 
processor.  If  the  size  of  the  original  problem  is  bg,  then  at  level  p  each 
processor  will  be  solving  a  subproblem  of  size  S'  =  bq~p.  If  the  problems 
of  size  S'  are  then  solved  in  parallel  (with  workload  distributed  equally 
between  the  processors),  then  each  processor  will  have  to  execute  the  same 
total  number  of  sequential  procedures  as  if  it  were  solving  one  problem  of 
size  S'  sequentially.  However,  in  addition  to  communication  overhead,  some 
system  overhead  will  be  added  due  to  context  switching.  Thus,  it  is  more 
efficient  to  solve  problems  of  size  S'  sequentially  than  by  using  the  parallel 
algorithm. 

Therefore,  for  a  distributed  system  with  some  finite  number  Np  of  iden¬ 
tical  processors,  the  parallel  base-case  size  for  the  original  problem  of  size 
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S  is  determined  by  the  performance  parameters  of  the  system  and  by  the 
number  of  processors: 

T3e9(S)  <  com{S)  (2.5) 

„  g 

ParBaseCaseSize  =  max{&5, 


Optimal  Depth  of  Data  Flow  Subtree 

Because  of  the  strategy  behind  the  Divide- and- Conquer  Archetype,  the  split 
of  several  levels  of  subproblems  or  the  merge  of  several  levels  of  subsolutions 
can  be  easily  combined  in  one  sequential  process  (see  figure  2.3).  Let  us 
call  such  processes  Group_Split  and  GroupJlerge.  By  changing  the  depth 
of  the  data-flow  subtree  executed  by  one  process,  the  user  can  control  the 
granularity  of  a  parallel  Divide- and- Conquer  algorithm. 

If  t(s )  is  the  execution  time  of  one  node  in  the  graph  (or  ther  Split () 
or  MergeQ  procedure),  then  the  execution  time  of  the  process  that  executes 
d  levels  of  the  tree  is 

d — 1  n 

Tgroupdi  d,  S',  /J 

*=0 


The  optimal  depth  of  the  data-flow  subtree  achieving  the  best  perfor¬ 
mance  can  be  found  by  solving  the  minimization  problem.  Suppose  that  the 
optimal  depth  is  equal  to  some  constant  D.  Then,  using  the  execution  times 
of  SplitQ,  MergeQ  and  BaseCaseSolutionQ,  we  derive  an  expression  for 
the  execution  time  of  such  an  algorithm  with  depth  D  for  original  problem 
of  size  5,  T(_D,5),  and  find  the  value  of  D  by  solving  the  minimization 
problem: 

min  T(cL  S) 

where  T(d,  S)  is: 


d\  —  1  /  g  g 

T(d,  5)  —  ^  )  f  1  qfpujj  ( r^-split ,  d.  )  -{-  Tgroupi^-  merge:  d,  . 

!=1  ' 


g 

T  Tgroupi  ^-  split ,  d2  •  )  T  d ~ g r 0 u (  T m fc T g t ,  < 

_l_  Tse9(ParBaseCaseSize) 


S_. , 
hb^' 


where  Np  =  did  +  d2,  and  d2  <  d. 
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Figure  2.4  illustrates  how  the  value  of  the  depth  of  the  subtree  can  affect 
the  execution  time  of  the  parallel  algorithm.  The  graph  shows  the  execution 
time  of  two  implementations  of  finding  the  minimum  element  in  the  array: 
the  upper  curve  corresponds  to  an  implementation  with  I)  =  1 .  and  the 
lower  curve  corresponds  to  an  implementation  with  D  computed  using  the 
suggested  approach. 

2.5.3  Skeleton 

The  Data  Flow  approach  with  mapping  described  in  section  2.5.1  has  the 
structure  shown  below.  Constant  Depth  denotes  the  depth  of  the  data¬ 
flow  subtree  mapped  onto  one  process.  Procedures  Group_Split()  and 
Group_Merge( )  denote  the  group  processes  discussed  in  the  previous  section. 

Solution_t  Parallel_DnCl (Problem_t  aProblem) 
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Figure  2.4:  Execution  time  of  finding  the  minimum  element. 


{ 

Solutions  aSolution; 

if  (Size (aProblem)  <  ParBaseCaseSize) 

/*  Problem  is  too  small  to  be  solved  in  parallel, 

*  solve  it  sequentially  */ 

aSolsution  =  DnC ( aProblem) ; 
else  { 

/*  solve  problem  in  parallel  */ 

Problemrt  subProblems  [KDepth]  ; 

Solutions  subSolutions [K]  ; 

Otherlnf  o_t  Other  [2(Depth_1)]  ; 
int  i; 

Group_Split (Problem,  subProblems,  Other); 
parfor (i  =  0;  i  <  KDepth;  i++) 

subSolutions  [i]  =  Parallel_DnCl (subProblems  [i] ) ; 
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aSolution  =  Group_Merge( subSolutions ,  Other); 

} 

} 

2.6  Implementations 

We  used  the  approach  presented  in  section  2.5  to  implement  software  tem¬ 
plate  in  C  with  PVM,  C  with  NX,  and  CC+  +  .  Since  the  code  for  all 
templates  is  very  similar,  we  will  present  and  discuss  the  source  code  for  the 
CC+  +  implementation  only. 

A  program  is  provided  to  the  user  to  determine  the  optimal  values  of 
constants  Depth  and  BaseCaseSize.  The  user  is  required  to  provide  the 
parameters  of  the  algorithm  and  the  target  system  discussed  in  sections 
2.3.1  and  2.5. 

The  user  is  required  to  define  two  classes  Problem_t  and  Solution_t 
with  the  following  public  interface: 

class  Problem_t{ 
public : 

Problemjt  () ; 

Problem_t  () ; 

int  Size () ; 

void  Split  (Problemjt  *)  ; 

friend  CCVoid&  operator^ (CCVoid&  ,  const  Problem_t&  ); 
friend  CCVoid&  operator^ (CCVoid&  ,  Problem_t&  ); 

}; 

class  Solution_t{ 
public: 

Solution_t  ()  {} ; 

~Solution_t  () ; 

void  BaseCaseSolution(Problem_t&  )  ; 
void  Merge  (Solutions  *); 

friend  CCVoid&  operator^ (CCVoid&  ,  const  Solution_t&  ); 
friend  CCVoid&  operator^ (CCVoid&  ,  Solution_t&  ); 

}; 
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The  last  two  functions  in  each  class  are  data  transfer  functions  required 
by  CC  +  +  [12].  They  define  how  the  data  of  an  object  of  the  class  should 
be  transferred  from  one  processor  object  to  another.  These  functions  are 
invoked  when  the  objects  of  a  class  are  used  as  arguments  to  procedures 
invoked  on  a  remote  processor  object. 

The  CC'++  implementation  defines  processor  object  type  DnCjt  which 
uses  two  classes  Machinesjt  and  Treejt.  These  classes  are  provided  to  the 
user. 

Class  Machinesjt:  is  used  for  representing  and  manipulating  arrays  of  the 
names  of  the  computers  (processors)  used  in  the  computation.  This  class 
has  the  following  public  interface: 

typedef  char  *string; 

class  Machinesjt{ 
public: 
int  number; 
int  *  length; 
string  *names; 

Machinesjt  () ; 

Machinesjt  (int  n) ; 

Machinesjt  (int  n,  char  **nl); 

Machinesjt  ()  ; 

void  Part  (int  number_of  _parts ,  int  which_part.  Machines  Jt&  m) ; 
friend  CCVoid&  operator^  (CCVoid& ,  const  Machines Jt&) ; 
friend  CCVoid&  operator^  (CCVoidfe ,  Machines  jt&) ; 

}; 

Member  function  Part  is  used  to  split  the  array  of  machine  names  into 
the  given  number  of  subarrays,  and  assign  given  part  number  to  the  array 
of  machines  m. 

Class  Treejt:  is  used  by  procedures  Group_Split()  and  Group_Merge()  for 
storing  and  manipulating  the  tree  of  subproblems  and  subsolutions  mapped 
onto  one  processor  object. 

typedef  struct  tree_level_t  { 
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int  size; 

//  the  number  of  nodes  on  the  level 
Problem_t  *subproblems ; 

//  array  of  the  subproblems  of  the  level 
Solutionrt  * subs olut ions ; 

//  array  of  the  subsolutions  of  the  level 
//  Pairs  subproblems [ij  and  subsolutions [i]  form 
//  the  nodes  of  the  level 
int  *child_index; 

//  array  of  indicies  of  the  nodes  of  the  lower  level 
//  connected  to  the  nodes  of  this  level 

}  tree_level_t ; 

class  Tree_t{ 
private : 

tree_level_t  *Levels; 

//  arrays  of  the  pointers  to  the  levels  of  the  tree 
int  lumberOfLevels ; 

//  number  of  levels  in  the  tree 
int  NumberOf Children; 

//  maximum  number  of  children  for  a  node 

public: 

Tree_t(int  levels  =  1,  int  numberof children  =  2) 
~Tree_t  ()  ; 

tree_level_t  *  Level(int  number); 
tree_level_t  *  LastLevelO  ; 
void  CreateIewLevel(int  level); 

}; 


Processor  Object  Type  DnC_t 
Class  DnC_t  has  the  following  interface: 

global  class  DnC_t{ 
private : 

Machinesrt  Processors; 
int  BaseCaseSize; 
int  ParBaseCaseSize; 
int  Depth; 
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int  Number_Of -Children; 


void  Group_Split (Problemrt  Problem,  Tree_t  *Tree) ; 

Solutions  GroupJlerge  (Treert  *Tree) ; 

public: 

DnC_t(int  basecasesize,  int  par_basecasesize ,  int  splitdepth, 
int  numberof children,  Machinesrt  processors); 

Solutions  DnC(Problem_t  problem); 

Solutionrt  Par_DnC(Problem_t  problem) ; 

}; 


Selected  Member  Functions 

Function  Par_DnC()  follows  the  skeleton  presented  in  section  2.5.3  very  closely 
and  does  not  requre  any  additional  explanations. 

Solutions  DnC_t :  :Par_DnC(Problem_t  aProblem) 

if  ( (Processors  .number  ==  1)  |[ 

(aProblem. SizeQ  <  BaseCaseSize)) 
return  Divide_and_Conquer  (aProblem) ; 
else  { 

Solutions  aSolution; 

Treert  *  Tree  =  new  Treert  (SplitDepth+1 , 

Number_0f -Children) ; 

tree_Levelrt  *last; 

DnC_t*global  Children [Processors .number] ; 
int  number_of _subproblems ,  part,  k,  i; 

//  split  the  problem  into  at  most  Processors. number 
//  of  subproblems  of  size  at  most  ParBase  Case  Size 

Group_Split (aProblem,  Tree); 

last  =  Tree^LastLevel(SplitDepth) ; 

number  _of_subproblems  =  last^s  ize; 

assert  (number _of_subproblems  <  Processors  .number) ; 

//  each  subproblem  will  be  solve  at  >=  1  processor 
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part  =  Processors  .number  /  number_of  _subproblems ; 
parfor  (int  k  =  0;  k  <  number _of  _subproblems ;  k++)  { 
Machines  *  M  =  new  MachinesQ; 

Processors  .Part  (number_of _subproblems  ,  k,  *M) ; 

{ 

proc_t  placement  =  proc_t  ("DnC .  out"  ,  M— -names  [0]  ) ; 

Children [k*part]  =  new  (placement) 

DnC_t (BaseCaseSize,  ParBaseCaseSize , 

SplitDepth,  lumber _0f -Children,  *M) ; 
last^subsolutions  [k]  = 

Children  [k*part]  .  Par _DnC (last— ^subproblems  [k]  ) ; 
delete  Children  [k*part] ; 

} 

delete  M; 

} 

aSolution  =  Group_Merge(Tree) ; 
return  aSolution; 

} 

} 

2.7  Applications 

2.7.1  Mergesort 

Problem  Description:  Given  an  array  of  N  integers,  sort  the  integers  in 
ascending  order. 

Components: 

Data  types:  Data  types  Problem_t  and  Solution_t  for  this  problem  are 
arrays  of  integers,  and  data  type  Other  Inf  o_t  is  undefined: 

typedef  struct  { 

int  size;  /*  number  of  elements  in  the  array  */ 

int  * values;  /*  the  array  of  elements  */ 

}  Problem_t,  Solutionrt; 
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Predicate  IsSolution 


isSolution(  P,S)  =  P. size  =  S. size 

and  S. values  is  some  permutation  of  P. values 
and  (Vi  :  0  <  i  <  n  :  S.values[i  —  1]  <  S.values[i 

BaseCaseSize  BaseCaseSize  =  1 
Functions  and  Procedures 


int  Size (Problem_t  P) 


return  P.size; 

} 


Solution_t  BaseCaseSolution(Problem_t  aProblem) 

{ 

return  aProblem; 

} 

void  Split  (Problem_t  aProblem, 

Problem_t  subProblems  []  , 

Otherlnfo_t  Other) 

{ 

subProblems  [0]  .  size  =  aProblem.  size/2 ; 

subProblems  [1]  .  size  =  aProblem.  size  -  subProblems  [0]  .  size  ; 
subProblems [0] .values  =  aProblem. values ; 
subProblems [1] .values  =  aProblem. values 
+  subProblems  [0]  .  size ; 

} 


Solutions  Merge  (Solutions  subSolutions  []  , 
Otherlnfo_t  Other) 


int  i ,  j  ; 

Solutions  Solution; 

Solution. size  =  subSolutions [0] . size  + 
subSolutions [1] . size ; 
i  =  0;  j  =  0; 

while  (  ( i  <  subSolutions  [0]  .  size)  && 
(j  <  subSolutions  [1]  .  size)  ) 
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if  (  (subSolutions  [0]  .  array  [i]  )  < 

(subSolutions  [1]  .  array  [j]  )  )  { 

Solution.  array[i+j]  =  subSolutions  [0]  .  array [i]  ; 
i++; 

} 

else  { 

Solution.  array[i+j]  =  subSolutions  [1]  .  array  [j]  ; 

j++; 

} 

for  (;  i  <  subSolutions  [0]  .  size ;  i++) 

Solution.  array[i+j]  =  subSolutions  [0]  .  array  [i]  ; 
for  (  ;  j  <  subSolutions  [1]  .  size  ;  j++) 

Solution.  array[i+j]  =  subSolut ions [0]  .  array  [i]  ; 
return  Solution; 

} 

Parameters:  function  Split ()  divides  the  given  array  into  two  arrays  of 
approximately  equal  size;  thus,  the  algorithmic  parameters  of  the  merge- 
sort  are  K  =  2  and  b  =  2.  Using  linear  regression  it  is  possible  to  find 
execution  time  functions  for  procedures  BaseCaseSolution( ),  SplitQ  and 
MergeQ.  On  the  Touchstone  Delta  these  functions  (in  milliseconds)  and  the 


communication  time  function  are: 

T  base—caseijl)  — 

0.0074 

T  splitijl)  — 

0.0006 

0. 00075n+  0.00696 

Tcora(^)  — 

0.0011677  +  0.24757 

where  n  is  the  size  of  the  array. 

Performance  results:  Using  the  above  parameters  we  can  compute  the 
optimal  depth  and  the  value  of  ParBaseCaseSize.  The  value  of  ParBaseCaseSize 
is  128. 

Table  2.1  summarizes  execution  time  of  the  algorithm  on  the  Touchstone 
Delta  for  different  values  of  the  depth.  Execution  time  is  given  in  seconds. 

The  numbers  that  correspond  to  the  depth  chosen  by  the  performance  model 
as  optimal  are  shown  in  emphasis  font.  The  discrepancy  between  predicted 
optimal  depth  and  actual  optimal  depth  can  be  explained  by  the  fact  that 
the  global  clock  was  used  in  all  performance  measurements  (including  the 
ones  used  for  computing  Tsp/;t,  Tcom,  etc)  and  by  the  fact  that  the  perfor¬ 
mance  model  does  not  take  into  account  such  parameters  as  contention  in 
the  network,  memory  caching,  and  the  overhead  of  recursive  calls. 
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Number  of  processors 

d  =  1 

d  =  2 

d  =  3 

d  =  4 

d  =  5 

2 

10.4 

— 

— 

— 

— 

4 

5.96 

6.02 

— 

— 

— 

8 

3.79 

3.90 

4.13 

— 

— 

16 

2.71 

2.82 

3.06 

3.41 

— 

32 

2.20 

2.31 

2.55 

2.89 

3.28 

64 

1.94 

2.05 

2.35 

2.65 

3.03 

Table  2.1:  Execution  time  of  the  mergesort  algorithm  on  Touchstone  Delta 
(in  seconds). 


The  graph  of  speedup  of  the  parallel  mergesort  with  respect  to  the  se¬ 
quential  algorithm  is  shown  in  figure  2.5.  The  parallel  mergesort  was  imple¬ 
mented  using  the  depth  predicted  by  the  model. 
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Chapter  3 

The  Branch  and  Bound 
Archetype 

3.1  Introduction 

Branch  and  Bound  is  a  technique  for  searching  an  implicit  directed  graph 
which  is  usually  acyclic  or  even  a  tree  [2]. 

The  Branch  and  Bound  approach  is  often  used  for  finding  an  optimal  so¬ 
lution  to  some  problem  specified  by  a  finite  but  possibly  very  large  space  of 
solutions.  The  search  graph  for  such  a  problem  consists  of  nodes  correspond¬ 
ing  to  a  partition  of  the  solution  space,  with  successive  nodes  representing 
smaller  and  smaller  subpartitions  of  preceding  nodes.  For  each  node  a  bound 
on  the  possible  value  of  any  solution  within  the  partition  of  this  node  is  cal¬ 
culated.  Usually,  this  bound  is  used  to  prune  certain  branches  of  a  search 
tree  if  a  better  solution  has  been  already  found.  Sometimes,  a  depth-first 
search  or  a  breadth-first  search  strategy  is  used.  More  often,  however,  the 
calculated  bound  is  also  used  to  choose  which  of  the  open  nodes  of  the  tree 
should  be  explored  first. 

Because  of  the  unstructured  search  strategy,  a  simple  parallel  implemen¬ 
tation  of  the  Branch  and  Bound  algorithms  often  does  not  scale  very  well, 
because  processes  exploiting  some  branches  of  the  tree  do  not  have  current 
information  about  the  best  solution  found  so  far.  One  could  improve  the 
performance  of  the  program  if  a  parallel  Branch  and  Bound  implementation 
is  composed  with  some  heuristic  algorithm  for  solving  the  problem.  The 
suboptimal  solutions  found  by  the  heuristic  algorithm  can  then  be  used  to¬ 
gether  with  the  calculated  bounds  for  pruning  certain  branches  of  the  search 
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tree. 


3.1.1  Assumptions 

For  the  purposes  of  this  report  we  make  following  assumptions: 

•  the  problem  being  solved  is  a  maximization  problem  that  can  be  spec¬ 
ified  by  problem  type,  objective  function  being  maximized  and  a  num¬ 
ber  of  constraints, 

•  the  bound  of  a  partition  is  a  single  real  number. 

If  these  assumptions  do  not  hold,  the  overall  design  approach  described  in 
this  chapter  is  still  valid;  only  small  number  of  details  has  to  be  changed. 
Moreover,  since  any  minimization  problem  can  be  easily  converted  into  a 
maximization  problem  (by  changing  the  sign  of  the  objective  function),  the 
above  assumptions  hold  for  most  problems  that  can  be  solved  using  the 
Branch  and  Bound  approach. 

3.2  Archetype  Skeleton 

It  has  been  suggested  in  many  books  on  computer  algorithms  (e.g.,  [2]) 

that  the  set  of  open  partitions  —  partitions  yet  to  be  expanded  —  should 
be  stored  in  a  heap.  In  the  program  that  follows  we  use  data  type  Heap_t, 
with  the  following  interface: 

void  AddPartition(Heap_t  ,  Partitions  ); 
int  Empty (Heaps  ); 

Partitions  RemoveBestPartition (Heap_t  ); 
void  RemoveWorseThan (Heaps  ,  Partition_t  ) ; 

Using  this  data  type  the  skeleton  of  the  Branch  and  Bound  Archetype 
is  as  follows: 

Partitions  BnB (Partitions  OriginalPartition) 

{ 

Heaps  UnexpandedPartitions ; 

Partitions  Subpartitions  [K]  ; 

Partitions  BestSolution; 

Partitions  aPart  it  ion; 
int  n ,  i ; 
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} 


AddPartition(UnexpandedPartitions,  OriginalPartition) ; 
while  (!  Empty  (UnexpandedP  art  it  ions) )  { 

aPartition  =  RremoveBestPartition(UnexpandedPartitions) ; 
n  =  Branch (aPart it  ion.  Subpartitions) ; 


for  (i  =  0;  i  <  n;  i++)  { 

if  (is Solution (SubPart it  ions [i] )  && 

(Bound (SubPart it  ions  [i] )  >  Bound(BestSolution) ) )  { 
BestSolution  =  Subpartitions [i] ; 

RemoveWorseThan(UnexpandedPartitions ,  BestSolution) ; 

} 

else 

if  (Bound (SubPart it ions [i]  )  >  Bound(BestSolution) ) 
AddP art  it ion (UnexpandedP art it  ions , 

Subpartitions [i] ) ; 

} 

} 

return  BestSolution; 


Constant  K  is  the  maximum  number  of  subpartitions  returned  by  the  func¬ 
tion  BranchQ. 


3.3  Archetype  Components 

In  order  to  develop  a  Branch  and  Bound  algorithm,  the  user  has  to  define 
one  data  type  and  several  functions  on  that  data  type,  as  described  below: 

Data  type:  Partition_t  is  a  user-defined  data  type  representing  a  par¬ 
tition  in  the  space  of  feasible  solutions,  i.e.,  a  non-empty  set  of  feasible 
solutions. 

Predicates  and  Ghost  Functions:  For  the  purposes  of  specification  and 
reasoning,  some  predicates  and  ghost  functions  should  be  defined  as  follows: 

V alue(S)  is  a  number- valued  function  that  returns  the  value  of  the  objective 
function  for  given  feasible  solutions. 

Set(P)  is  a  function  whose  value  is  the  set  of  all  feasible  solutions  for  par¬ 
tition  P. 
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M etric(P)  is  a  number- valued  function,  whose  value  corresponds  to  some 
metric  of  the  partition,  such  that  M etric(P)  =  0  implies  that  partition 
P  contains  only  one  feasible  solution,  that  is  |5ef(P)|  =  1.  Functions 
Set(P)  and  Metric(P)  are  related  to  each  other  as  follows: 

(Set(Pi)  C  Set(P2))  =  (Metric(Pi)  <  Metric(P2 )) 
and 

(Set(Pi)  C  Set(P2))  =>  {Metric(Pi)  <  Metric(P2 )) 

isW  ellFormed(P)  is  a  predicate  that  holds  if  and  only  if  given  partition 
P  is  a  well-formed  partition.  This  predicate  is  required  because  in 
programming  languages  variables  of  the  certain  types  are  allowed  to 
take  many  values,  some  of  which  do  not  correspond  to  a  valid  partition 
within  the  context  of  the  problem  being  solved. 

Procedures  and  Functions: 

BranchQ  function  divides  a  given  partition  into  1  or  more  subpartitions 
with  strictly  smaller  metric  value.  Formally,  the  BranchQ  function  is 
defined  as  follows: 

int  Branch (Part it ion_t  aPartition, 

Partition_t  subpartitions  []  ) 

/*  Precondition:  isWellFormed(aPartition) 

*  Postcondition:  ( Return  value  =  n)  and  ( n  >  1)  and 

*  (V  i  :  0  <  i  <  n  : 

*  Metric(subPartitions[i ])  <  Metric^a Partition)) 

*  and  (S et (a Partition)  =  U^o1  Set(subPartitions[i ])) 

v 


Note  that  this  specification  can  be  weakened  if  we  use  the  set  of  all 
feasible  solutions  to  the  original  problem  being  solved,  V.  Then  the 
specification  of  function  BranchQ  is  as  follows: 

int  Branch  (Part  it  ion_t  aPartition, 

Partition_t  subpartitions  []  ) 

/*  Precondition:  isWellFormed(aPartition) 

*  Postcondition:  (Return  value  =  n)  and  ( n  >  1)  and 
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*  (Vi:  0  <  i  <  n  : 

*  M  etric(subPartitions[i])  <  Metric(a  Partition)) 

*  and  (Set (a Partition)  C  U^Tg1  Set(subPartitions[i ])) 

*  and  (Vi:  0  <  i  <  n  :  Set(subPartitions[i ])  C  Set(V )) 

V 

isSolutionQ  is  a  function  that  tests  whether  a  given  partition  contains 
one  or  more  feasible  possible  solution. 

int  isSolut ion (Partitions  aPartition) 

/*  Precondition:  isWeUFormed(aPartition) 

*  Postcondition: 

*  (( Return  value  =  0)  and  (\Set(  a  Partition)}  >  1)) 

*  or  (( Return  value  =  1)  and  (\Set (a Partition)}  =  1)) 

V 

Bound()  is  a  user-defined  function  that  evaluates  the  bound  on  all  feasible 
solutions  within  a  given  partition.  If  a  partition  contains  only  one 
solution  (or  equivalently,  function  isSolutionQ  returns  1  for  a  given 
partition),  then  function  BoundQ  returns  the  true  solution  value  for 
the  only  solution  in  the  partition;  otherwise,  BoundQ  returns  some 
upper  bound  on  all  possible  solutions  within  a  given  partition. 

float  Bound(Partition_t  aPartition) 

/*  Precondition:  isWellFormed(aPartition)  and 

*  (\Set(a Partition)}  >  1) 

*  Postcondition:  ( Return  value  =  /)  and 

*  (V  j  :  s  E  Set(a Partition)  :  Value(s)  <  /))  and 

*  ((\Set(a  Partition)}  =  1)  =^ 

*  (Vs  :  s  G  Set(a Partition)  :  Value(s)  =  /)) 

V 

The  efficiency  of  a  Branch  and  Bound  algorithm  depends  on  the  tight¬ 
ness  of  the  bound  returned  by  the  BoundQ  function. 


3.4  Approaches  to  Parallel  Implementation 

Any  parallel  approach  to  the  best-first  search  strategy  can  be  used  to  im¬ 
plement  a  parallel  Branch  and  Bound  algorithm.  Many  approaches  that 
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have  been  already  described  and  analyzed  (for  example,  see  [6,  9])  can  be 
implemented  as  programming  templates  using  the  components  described  in 
section  3.3. 

One  of  the  simplest  centralized  strategies  is  the  master-and-slave  strat¬ 
egy.  One  processor  is  assigned  the  role  of  “master”  and  stores  the  list  of 
unexpanded  partitions.  Other  processors  are  “slaves”  that  expand  the  par¬ 
titions  sent  by  the  “master”  process,  calculate  their  bounds  and  return  the 
results  to  the  “master”.  At  each  instant,  when  a  “slave”  becomes  idle  the 
“master”  selects  the  best  partition  from  the  list  of  unexpanded  partitions 
and  sends  it  to  the  “slave”  process.  Since  in  this  strategy  several  partitions 
are  expanded  at  once,  the  parallel  implementation  may  expand  nodes  that 
would  not  be  expanded  by  a  sequential  algorithm. 

The  performance  of  the  master-and-slave  strategy  is  limited  by  the  fact 
that  a  message  is  exchanged  between  the  “master”  and  a  “slave”  processes 
for  each  partition.  This  factor  can  affect  the  scalability  of  the  parallel  im¬ 
plementation.  Several  small  modifications  to  the  strategy  can  be  made  to 
improve  performance: 

•  Together  with  the  partition  to  be  expanded,  the  “master”  process 
sends  the  best  currently  known  solution,  thus  delegating  part  of  the 
pruning  to  the  “slave”  and  reducing  the  number  of  messages  in  the 
system. 

•  Another  way  to  reduce  the  number  of  messages  in  the  system  is  to 
allow  “slave”  processes  to  expand  partitions  down  to  a  certain  level  of 
the  search  graph. 

•  The  master-and-slave  implementation  of  a  problem  can  be  composed 
with  some  heuristic  algorithm  for  solving  the  problem.  Then,  solutions 
found  by  the  heuristic  process  can  then  be  used  to  eliminate  some  paths 
of  the  search  graph  and  avoid  their  expansion. 

3.5  Implementations 

The  master-and-slave  strategy  and  its  several  modifications  were  imple¬ 
mented  in  C  with  NX  and  C  with  PVM.  Since  the  source  code  for  the 
NX  and  PVM  implementations  is  very  similar,  we  will  present  and  discuss 
the  PVM  implementation  only. 

The  user  is  required  to  define  data  type  Partitions,  several  functions 
on  that  data  type,  and  global  data.  In  addition  to  the  archetype’s  compo- 
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nent  functions,  the  user  has  to  define  procedure  FreePartitionQ  that  deal¬ 
locates  the  memory  that  was  allocated  for  a  partition,  and  several  platform- 
dependent  communication  procedures. 

/*  Archetype’s  Components  * / 

int  Branch  (Partitions;  aPartition, 

Partitions  * subPart it ions ) ; 
float  Bound  (Partitions  aPartition); 
int  IsSolution(PartitionS  aPartition); 

/*  Memory  Management  Procedure  */ 

void  FreePartition(PartitionS  aPartition); 

/*  Communication  Procedures,  */ 

void  SendGlobalData(long  msgSype,  long  node); 
void  ReceiveGlobalData(long  msgSype); 
void  SendP art  it  ion (Part  it ionS  aPartition, 

long  msgSype,  long  node); 
void  ReceivePartition(PartitionS  *  aPartition, 

long  msgSype); 


Slave 

Until  a  termination  message  is  received  from  the  “master”  process,  a  “slave” 
receives  a  partition  to  expand,  divides  it  into  several  subpartitions  using 
function  BranchQ,  calculates  the  bounds  for  each  of  the  subpartitions,  and 
sends  the  results  back  to  the  master  process,  together  with  a  request  for 
another  partition. 


void  Slave  (int  MaxWumOf Subpartitions) 

{ 

Partitions  p,  ^subpartitions ; 
float  bound,  solution; 
int  terminate  =  0; 
int  i,  n; 

int  master,  dummy; 

int  bufid,  bytes,  msgSype,  tid,  mytid; 

/***  Initialize  local  variables  ***/ 

mytid  =  pvm_mytid(); 
master  =  pvm_parent  ()  ; 

subpartitions  =  (Partitions  *)calloc  (sizeof(PartitionS)  , 
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MaxNumOf Subpartitions)  ; 

/***  receive  global  data  *** / 

pvm_recv(master ,  MSG_GLOBAL_DATA)  ; 

ReceiveGlobalData(MSG-GLOBALJ)ATA)  ; 

while  (Iterminate)  { 

/***  wait  for  the  next  message  to  arrive  ***/ 

bufid  =  pvm_recv (master ,  -1); 
pvm_buf inf o (but id,  kbytes,  &msg_type,  &tid); 

switch  (msg_type)  { 
case  MSG.TERMIWATE: 

pvmjipkint  (Mummy ,  1,  1); 
terminate  =  1; 
break ; 

case  MSG_PARTITION: 

pvm_upkf loat (^solution,  1,  1); 

ReceivePartition(&p,  msg_type)  ; 

n  =  Branch(p,  subpartitions); 
for  (i  =  0;  i  <  n;  i++)  { 

bound  =  Bound(&(subpartitions [i] ) ) ; 
if  (IsSolution(subpart itions [i] ) ) 
msg_type  =  MSG_A_S0LUTI0W ; 
else 

msg_type  =  MSG_PARTITI0N; 
p¥m_initsend(PvmDataDef  ault)  ; 
pvm_pkf  loat  C&bound,  1,  1); 

SendPartition(subpartitions [i]  , 

mytid*10+msg_type ,  master); 
pvm_send (master,  msg_type)  ; 

FreeP art it ion (subpart it ions [i] ) ; 

} 

pvm_initsend(PvmDataDef  ault)  ; 
pvm_pkint  (Mummy ,  1,  1); 
pvm_send(master ,  MSG_REQUEST)  ; 

FreePartition(p) ; 
break ; 

} 

} 

} 

Master 

In  the  following  implementation,  the  “master”  process  takes  as  an  argument 
an  array  of  task  id’s  of  “slave”  processes.  It  assumes  that  initially  all  “slave” 


/*  termination  detection  */ 


/*  a  partition  received  */ 
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processes  are  idle  and  are  waiting  for  a  partition  to  be  sent.  After  global 
information  is  transferred  to  the  slave  processes,  the  “master”  sends  off  the 
first  partition.  Depending  on  the  type  of  the  message  received,  the  “master” 
takes  various  actions:  registers  a  received  solution,  inserts  a  received  parti¬ 
tion  into  the  list  of  unexpanded  partitions,  sends  another  partition  to  the 
“slave”  process  or  registers  “slave”  process  as  idle.  Whenever  the  received 
solution  is  better  than  the  current  best  solution,  the  “master”  removes  par¬ 
titions  with  lower  bounds  from  the  list  of  unexpanded  partitions.  When  all 
“slave”  processes  are  registered  as  idle  and  the  list  of  unexpanded  partitions 
is  empty,  the  “master”  process  terminates  and  sends  termination  messages 
to  all  “slave”  processes. 

Partition_t  Master (Partition_t  aPartition, 

int  numWorkers,  int  ^Workers ) 


Node_t  *Root  =  NULL; 

Partition_t  partition; 

Partitions  aSolution; 

float  bound,  solution  =  -1.0; 

int  found_a_solution  =  0; 

int  first,  last,  numldleWorkers,  dummy; 

int  terminate  =0,  i; 

int  msg_type,  bufid,  bytes,  tid; 

/***  Initialize  local  variables  ***/ 
first  =  0; 

last  =  numWorkers-1 ; 
numldleWorkers  =  numWorkers; 

/***  Send  global  data  off  to  all  workers  ***/ 

for  (i  =  0;  i  <  numWorkers;  i++)  { 
pvm_initsend(PvmDataDef  ault)  ; 
SendGlobalData(MSG_GLOBAL_DATA ,  Workers  [i]  )  ; 
pvm_send(Workers  [i]  ,  MSG_GL0BAL_DATA)  ; 

} 


/***  Send  the  first  partition  off  to  the  first  worker  ***/ 

pvm_initsend(PvmDataDef  ault)  ; 
pvm_pkfloat  (ftsolution,  1,  1); 

SendPartition(aPartition,  MSG_PARTITI0N,  Workers  [first]  ) ; 
pvm_send  (Workers  [first]  ,  MSG_PARTITI0N) ; 
first  =  (first+1)  7,  numWorkers; 
numldleWorkers — ; 

/***  continue  until  there  are  no  more  partitions  to  expand 
***  and  all  workers  are  idle  ***/ 
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while  ( (  !  isEmpty (Moot) )  ||  (numldleWorkers  <  numWorkers))  { 

/***  wait  for  a  message  to  arrive  ***/ 
bufid  =  pvm_recv(-l,  -1); 

pvm_buf inf o (buf id,  kbytes,  &msg_type,  &tid); 
switch  (msg_type)  { 

case  MSG_A_S0LUTI0N :  /*  message  contains  a  solution  */ 

p  vm_upkf 1 o  at ( &b ound ,  1 ,  1 ) ; 

ReceivePartitionC&partition,  tid*10+msg_type)  ; 
if  (  (  !f  ound_a_solution)  || 

(  (f  ound^a_solution)  &&  (bound  >  solution)))  { 
if  (f  ound_a_solution) 

FreePartition(aSolution) ; 
aSolution  =  partition; 
solution  =  bound; 

RemoveBadWodes(&Root ,  solution); 
found_a_solution  =  1; 

} 

else 

FreePartition(partition)  ; 
break ; 

case  MSG_REQUEST:  /*  message  contains  a  request  for  work  */ 

pvm_upkint  (Mummy ,  1,  1); 
if  ( ! isEmpty(&Root) )  { 

partition  =  RemoveFirst (&Root) ; 
pvm_initsend(PvmDataDef  ault)  ; 
pvm_pkf  loat  (ftsolution,  1,  1); 

SendPartition(partition,  MSG_PARTITION,  tid)  ; 
pvm_send(tid,  MSG_PARTITION)  ; 

FreePartition(partition)  ; 

} 

else  { 

last  =  (last  +  1 )  %  numWorkers; 

Workers  [last]  =  tid; 
numIdleWorkers++; 

} 

break ; 

case  MSG_PARTITION:  /*  message  contains  a  partition  */ 

p vm_upkf 1 o  at ( &b ound ,  1 ,  1 ) ; 

ReceivePartition(&partition,  tid*10+msg_type)  ; 
if  (  (  !f  ound^a_solution)  || 

( (f  ound^a_solution)  &&  (bound  >  solution))) 
if  (numldleWorkers  >0)  { 

pvm_initsend(PvmDataDefault)  ; 
pvm_pkf  loat  (ftsolution,  1,  1); 

SendPartition(partition,  MSG_PARTITI0W , 

Workers  [first]  )  ; 
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p¥iti_send  (Workers  [first]  ,  MSG_PARTITIOW)  ; 
first  =  (first+1)  7.  numWorkers ; 
numldleWorkers-- ; 
FreePartition(partition)  ; 

} 

else 

InsertWode (&Root ,  partition,  bound); 

else 

FreePartition(partition) ; 
break ; 

} 

} 


/***  Send  ’’terminate”  message  to  all  workers  ***/ 

for  (i  =  0;  i  <  numWorkers;  i++)  { 
pvm_initsend(PvmDataDef  ault)  ; 
pvm_pkint  (^terminate ,  1,  1); 
pvm_send (Workers  [i]  ,  MSG.TERMINATE)  ; 

} 

DeleteList (&Root) ; 
return  aSolution; 

} 

3.6  Applications 

3.6.1  Zero-One  Knapsack 

Problem  Description:  Given  a  knapsack  of  capacity  C,  and  n  objects 
with  non-zero  weights  W{  and  non-zero  values  vt ,  find  a  collection  of  objects 
to  be  put  into  the  knapsack  that  maximizes  its  total  value,  or 

maximize  ^"Jo1  ViXi 
subject  to  X^o1  wixi  <  C 
where  xt ;  G  {0, 1} 

where  xz  =  1  means  that  object  i  is  included  in  the  knapsack,  and  Xi  =  0 
means  that  it  is  not  included  in  the  knapsack. 


Components:  The  problem  being  solved  is  defined  by  the  values  of  the 
following  global  variables: 

float  Capacity;  /*  capacity  of  the  knapsack  */ 

int  numObjects;  /*  number  of  objects  */ 

float  Obj ectWeights  [numObj ects]  ;  /*  objects'  weights  */ 
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float  ObjectValues [numObjects] ;  /*  objects’  values  */ 

/*  the  objects  are  ordered  in  descending  order  according  to  the 
*  value-density  * / 

Data  type:  A  partition  is  defined  by  the  number  of  objects  about  which 
the  decision  has  been  made,  that  is,  by  the  number  of  variables  Xi 
whose  values  are  fixed  at  1  or  0,  and  by  their  values.  Therefore,  we 
define  the  data  type  Partitions  as  follows: 

typedef  struct  { 
int  k; 
char  x [k]  ; 
float  value; 
float  weight; 

}  Partitions; 

Predicate  and  Ghost  Functions:  The  objective  function  Value(S)  is  de¬ 
fined  only  for  partitions  S  such  that  S . k  =  numObjects: 


/*  number  of  fixed  variables  * / 

/*  values  of  the  variables,  1  or  0  */ 

/*  sum:  i  in  0..(k-l):  x[k]*  Object  Valuesfk]  */ 
/*  sum:  i  in  0..(k-l):  x[k]*ObjectWeights[k]  */ 


numObjects 

Value(S)  =  ^  S.x[i]  ■  ObjectValuesfi] 

;=o 

The  set  of  feasible  solutions  in  a  given  partition  P,  Set( P),  is  formed 
by  all  possible  combinations  of  numObjects  values  yi  such  that: 

(Vi :  0  <  i  <  numObjects  :  yi  £  {0, 1}) 
and  (Vi  :  0  <  i  <  P.k  :  y;  =  P.x[i]) 

numObjects 

and  ^  yi  ■  Obj  ectWeight[i]  <  Capacity 

i= o 

Finally,  function  Metric^ P)  is  defined  as  follows: 

Meiric(P)  =  numObjects  —  P.k 


Procedures:  In  order  to  calculate  an  upper  bound  on  the  value  of  the 
knapsack  in  partition  P,  the  cheesecake  problem  is  solved  for  objects 
P.k  +  1, . . ., numObjects  with  capacity  Capacity  —  P. weight: 
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float  Bound(Partition_t  aPartition)  { 
float  value,  capacity; 
int  i ; 

value  =  0.0; 

capacity  =  Capacity  -  aPart  it  ion.  weight ; 
for  (i  =  aPartition. K ; 

(capacity  >  0.0)  &&  (i  <  numObjects)  ;  i++) 
if  (ObjectWeights  [i]  <  capacity)  { 
value  +=  ObjectValues [i] ; 
capacity  -=  ObjectWeights  [i] ; 

} 

else  { 

value  +=  ObjectValues [i] *capacity/ObjectWeights [i] ; 
capacity  =  0.0; 

} 

return  (value+aPartition. value) ; 

} 

int  isSolution(Partition_t  aPartition)  { 
return  (aPartition. k  ==  numObjects); 

} 


The  function  Branch()  divides  the  partition  P  into  at  most  two  subpar¬ 
titions  by  making  decision  about  the  object  P.k.  In  one  subpartition 
the  object  P.k  is  put  into  the  knapsack,  in  the  other  one  it  is  left  out: 

int  Branch(Partition_t  aPartition,  Partition_t  subpartitions  [2]  )  { 

/*  copy  the  values  of  the  fixed  variables  from  aPartition  */ 
subpartitions  [0]  =  aPartition; 
subpartitions  [1]  =  aPartition; 

/*  fix  value  of  (aPartition.k)-th  variable  */ 

subpartitions  [0]  .k  =  aPartition. k  +  1; 
subpartitions  [1]  .k  =  aPartition. k  +  1; 

/*  do  not  put  the  object  in  the  knapsack  */ 

subpartitions  [0]  ,x[aPartition.k]  =  0; 

/*  put  the  object  into  the  knapsack  (if  possible)  */ 
if  (ObjectWeights  [aPartition. k]  < 
subpartitions  [1]  .  capacity)  { 
subpartitions  [1]  .x  [aPart  it  ion.  k]  =  1; 
subpartitions  [1]  .  value  +=  ObjectValues  [aPartition.  k]  ; 
subpartitions  [1]  .  weight  +=  ObjectWeights  [aPartition. k]  ; 
return  2 ; 
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} 

else 

return  1 ; 

} 

Performance  Results:  Figure  3.1  compares  the  execution  time  of  the 
sequential  zero-one  knapsack  problem  with  the  execution  time  of  the  master - 
and-slave  strategy  (ms  curve)  and  the  master-and-slave  strategy  in  which  the 
“master”  process  sends  the  best  known  solution  together  with  the  partition 
to  expand  (msl  curve).  The  measurements  were  taken  on  the  Touchstone 
Delta  for  10,000  objects  with  uniformly  distributed  non-zero  weights  and 
values. 


Figure  3.1:  Execution  time  of  zero-one  knapsack  program  on  Touchstone 
Delta. 


The  initial  increase  in  execution  time  can  be  explained  by  the  fact  that 
when  only  2  processors  are  used  there  is  only  one  slave  process.  Therefore, 
the  parallel  implementation  is  computationally  equivalent  to  the  sequential 
one,  except  for  communication  overhead.  The  increase  in  execution  time 
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for  more  than  20  processors  can  be  explained  by  the  fact  that  the  “slave” 
processes  spend  some  time  expanding  “bad”  partitions  and  also  by  the  in¬ 
creasing  amount  of  communication,  so  that  the  master  process  becomes  a 
bottleneck. 
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Chapter  4 


Conclusion 


We  have  presented  two  programming  archetypes  in  combinatorics  and  opti¬ 
mization.  For  each  archetype  the  program  skeleton,  archetype  components 
and  several  implementations  were  given.  For  each  archetype  an  example 
application  from  the  archetype’s  domain  was  presented  to  illustrate  how  the 
archetype  can  be  used. 

It  is  interesting  to  note  that  once  the  template  for  an  archetype  had 
been  written,  the  effort  required  for  developing  a  parallel  application  was 
decreased  significantly.  In  addition  to  the  examples  presented  in  this  re¬ 
port  several  other  example  applications  were  developed:  Manhattan  Sky¬ 
line,  Nearest  Neighbor,  Traveling  Salesman  Problem  and  Zero-One  Knap¬ 
sack  with  two  knapsacks. 

Several  directions  for  further  development  of  the  discussed  Archetypes 
present  themselves: 

•  A  software  template  can  be  written  for  the  Control  Flow  approach  to 
parallel  implementation  of  Divide- and- Conquer  algorithms.  The  user 
might  be  required  to  develop  sequential  program  in  a  specific  fashion, 
so  as  to  simplify  the  parallelization  step. 

•  By  using  the  Divide- and- Conquer  archetype  presented  in  chapter  2  the 
user  can  reduce  the  effort  required  for  developing  a  parallel  Divide- 
and-Conquer  algorithm  by  developing  and  debugging  the  sequential 
program  first.  However,  the  scalability  of  the  presented  approach  is  far 
from  perfect.  A  modified  Divide- and- Conquer  archetype  with  a  more 
scalable  parallel  implementation  would  be  even  more  useful.  Such 
implementation  is  presented  and  discussed  in  [4]. 
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•  A  performance  model  for  the  Branch  and  Bound  archetype  that  can 
predict  the  performance  of  a  parallel  implementation  or  choose  an 
efficient  implementation  for  target  architecture  can  be  a  wonderful 
tool  for  programmers. 
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Appendix  A 

Electronic  Textbook 


The  full  text  of  the  programming  templates  in  CC  +  +  ,  C  and  NX,  and  C 
and  PVM,  together  with  documentation,  and  several  example  programs  in 
addition  to  the  ones  presented  in  this  report  will  be  made  publicly  available 
as  part  of  the  electronic  textbook  on  Parallel  Programming  Archetypes. 
Several  chapters  of  the  textbook  are  currently  available  on  the  World  Wide 
Web  at  http://www.etext.caltech.edu.  The  structure  and  contents  of 
the  textbook  are  described  in  [1], 
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